Skip to content

[CELEBORN-2327] Add active-slot weight to load-aware placement#3685

Closed
sunchao wants to merge 3 commits into
apache:mainfrom
sunchao:dev/chao/codex/active-slots-placement-oss
Closed

[CELEBORN-2327] Add active-slot weight to load-aware placement#3685
sunchao wants to merge 3 commits into
apache:mainfrom
sunchao:dev/chao/codex/active-slots-placement-oss

Conversation

@sunchao
Copy link
Copy Markdown
Member

@sunchao sunchao commented May 13, 2026

Why are the changes needed?

Celeborn load-aware slot placement currently orders candidate disks using flush and fetch timing only. That can still keep assigning new partitions onto disks that already carry a large amount of reserved active-slot pressure, which makes placement skew worse under overlapping shuffle-heavy workloads. CELEBORN-2327 tracks this gap.

What changes were proposed in this PR?

  • Add an optional, default-off celeborn.master.slot.assign.loadAware.activeSlotsWeight config.
  • Include activeSlots * activeSlotsWeight in load-aware disk ordering.
  • Thread the new config through the master allocation path.
  • Document the new tuning knob and update the slot-allocation developer docs.
  • Add a regression test showing that, when configured, lower-active-slot disks are preferred over otherwise equivalent disks.

How was this PR tested?

  • UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite
  • ./build/mvn -pl master -am -Dtest=SlotsAllocatorSuiteJ -DwildcardSuites=org.apache.celeborn.NoSuchSuite -DfailIfNoTests=false test
  • ./build/mvn -pl master -am -DskipTests test-compile

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new optional celeborn.master.slot.assign.loadAware.activeSlotsWeight config (default 0.0) so load-aware slot placement can also account for each disk's currently-reserved active slots when sorting candidate disks, helping reduce placement skew under overlapping shuffle-heavy workloads.

Changes:

  • Define and thread the new activeSlotsWeight config from CelebornConf through Master.scala into SlotsAllocator.offerSlotsLoadAware / placeDisksToGroups, where it is added to the existing flush/fetch-time sort key.
  • Document the new tuning knob in docs/configuration/master.md and update docs/developers/slotsallocation.md to reflect the updated ordering formula.
  • Add a regression test in SlotsAllocatorSuiteJ that verifies, with activeSlotsWeight=1 and flush/fetch weights 0, the lower-active-slot worker is preferred over an otherwise equivalent overloaded worker.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Adds MASTER_SLOT_ASSIGN_LOADAWARE_ACTIVE_SLOTS_WEIGHT config entry and accessor.
master/src/main/scala/org/apache/celeborn/service/deploy/master/Master.scala Reads new config and passes it into offerSlotsLoadAware.
master/src/main/java/org/apache/celeborn/service/deploy/master/SlotsAllocator.java Adds activeSlotsWeight parameter and incorporates activeSlots * activeSlotsWeight into disk comparator.
master/src/test/java/org/apache/celeborn/service/deploy/master/SlotsAllocatorSuiteJ.java Adds regression test and threads new parameter through existing helper call sites.
docs/configuration/master.md Documents new config row.
docs/developers/slotsallocation.md Updates ordering formula description.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Member

@SteNicholas SteNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for contribution.

Comment thread docs/developers/slotsallocation.md
Comment thread common/src/main/scala/org/apache/celeborn/common/CelebornConf.scala Outdated
@SteNicholas SteNicholas requested a review from pan3793 May 15, 2026 09:30
@SteNicholas
Copy link
Copy Markdown
Member

@sunchao, please use UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite to update master.md for CI failure.

@sunchao
Copy link
Copy Markdown
Member Author

sunchao commented May 18, 2026

@SteNicholas Updated

@SteNicholas
Copy link
Copy Markdown
Member

Thanks. Merged to main(v0.7.0).

@sunchao
Copy link
Copy Markdown
Member Author

sunchao commented May 18, 2026

Thanks @SteNicholas @pan3793 for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants